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Abstract 

Tackling pattern recognition problems in areas such as computer vision, bioinformatics, speech or text recognition 
is often done best by taking into account task-specific statistical relations between output variables. In structured 
prediction, this internal structure is used to predict multiple outputs simultaneously, leading to more accurate and 
coherent predictions. Structural support vector machines (SSVMs) are nonprobabilistic models that optimize a joint 
input-output function through margin-based learning. Because SSVMs generally disregard the interplay between unary 
and interaction factors during the training phase, final parameters are suboptimal. Moreover, its factors are often 
restricted to linear combinations of input features, limiting its generalization power. To improve prediction accuracy, 
this paper proposes: (i) Joint inference and learning by integration of back-propagation and loss-augmented inference 
in SSVM subgradient descent; (ii) Extending SSVM factors to neural networks that form highly nonlinear functions 
of input features. Image segmentation benchmark results demonstrate improvements over conventional SSVM training 
methods in terms of accuracy, highlighting the feasibility of end-to-end SSVM training with neural factors. 

Keywords: structural support vector machine, neural factors, structured prediction, neural networks, image 
segmentation 


1. Introduction 

In traditional machine learning, the output consists of a 
single scalar, whereas in structured prediction, the out¬ 
put can be arbitrarily structured. These models have 
proven useful in tasks where output interactions play an 
important role. Examples are image segmentation, part- 
of-speech tagging, and optical character recognition, where 
taking into account contextual cues and predicting all out¬ 
put variables at once is beneficial. A widely used frame¬ 
work is the conditional random field (CRF), which mod¬ 
els the statistical conditional dependencies between input 
and output variables, as well as between output variables 
mutually. However, many tasks only require ‘most-likely’ 
predictions, which led to the rise of nonprobabilistic ap¬ 
proaches. Rather than optimizing the Bayes’ risk, these 
models minimize a structured loss, allowing the optimiza¬ 
tion of performance indicators directly [T] . One such model 
is the structural support vector machine (SSVM) [5] in 
which a generalization of the hinge loss to multiclass and 
multilabel prediction is used. 

A downside to traditional SSVM training is the bifur¬ 
cated training approach in which unary factors (dependen¬ 
cies of outputs on inputs), and interaction factors (mutual 
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output dependencies) are trained sequentially. A unary 
classification model is optimized, while the interactions 
are trained post-hoc. However, this two-phase approach 
is suboptimal, because the errors made during the train¬ 
ing of the interaction factors cannot be accounted for dur¬ 
ing training of the unary classifier. Another limitation 
is that SSVM factors are linear feature combinations, re¬ 
stricting the SSVM’s generalization power. We propose 
to extend these linearities to highly nonlinear functions 
by means of multilayer neural networks, to which we refer 
as neural factors. Towards this goal, subgradient descent 
is extended by combining loss-augmented inference with 
back-propagation of the SSVM objective error into both 
unary and interaction neural factors. This leads to bet¬ 
ter generalization and more synergy between both SSVM 
factor types, resulting in more accurate and coherent pre¬ 
dictions. 

Our model is empirically validated by means of the com¬ 
plex structured prediction task of image segmentation on 
the MSRC-21, KITTI, and SIFT Flow benchmarks. The 
results demonstrate that integrated inference and learn¬ 
ing, and/or using neural factors, improves prediction ac¬ 
curacy over conventional SSVM training methods, such as 
A-slack cutting plane and subgradient descent optimiza¬ 
tion [T]. Furthermore, we demonstrate that our model is 
able to perform on par with current state-of-the-art seg¬ 
mentation models on the MSRC-21 benchmark. 





2. Related work 

Although the combination of neural networks and struc¬ 
tured or probabilistic graphical models dates back to the 
early ’90s Hi], interest in this topic is resurging. Several 
recent works introduce nonlinear unary factors/potentials 
into structured models. For the task of image segmenta¬ 
tion, Chen et al. train a convolutional neural network 
as a unary classifier, followed by the training of a dense 
random field over the input pixels. Similarly, Farabet et 
al. [6] combine the output maps of a convolutional network 
with a CRF for image segmentation, while Li and Zemel 
[7] propose semisupervised maxmargin learning with non¬ 
linear unary potentials. Contrary to these works, we trade 
the bifurcated training approach for integrated inference 
and training of unary and interactions factors. Several 
works 0 El uni HD focus on linear-chain graphs, using an 
independently trained deep learning model whose output 
serves as unary input features. Contrary to these works, 
we focus on more general graphs. Other works suggest 
kernels towards nonlinear SSVMs [13 ESI; we approach 
nonlinearity by representing SSVM factors by arbitrarily 
deep neural networks. 

Do and Artieres [M] propose a CRF in which poten¬ 
tials are represented by multilayer networks. The perfor¬ 
mance of their linear-chain probabilistic model is demon¬ 
strated by optical character and speech recognition using 
two-hidden-layer neural network outputs as unary poten¬ 
tials. Furthermore, joint inference and learning in linear- 
chain models is also proposed by Peng et al. [15], however, 
the application to more general graphs remains an open 
problem m- Contrary to these works, we popose a non- 
probabilistic approach for general graphs by also model¬ 
ing nonlinear interaction factors. More recently, Schwing 
and Urtasun m train a convolutional network as a unary 
classifier jointly with a fully-connected CRF for the task 
of image segmentation, similar to mM- Chen et al. [5D| 
advocate a joint learning and reasoning approach, in which 
a structured model is probabilistically trained using loopy 
belief propagation for the task of optical character recog¬ 
nition and image tagging. Other related work includes 
Domke [HI who uses relaxations for combined message¬ 
passing and learning. 

Other related work aiming to improve conventional 
SSVMs are the works of Wang et al. [Hj and Lin et al. [23] , 
in which a hierarchical part-based model is proposed for 
multiclass object recognition and shape detection, focus¬ 
ing on model reconfigurability through compositional al¬ 
ternatives in And-Or graphs. Liang et al. [H] propose the 
use of convolutional neural networks to model an end-to- 
end relation between input images and structured outputs 
in active template regression. Xu et al. [25] propose the 
learning of a structured model with multilayer deformable 
parts for action understanding, while Lu et al. |26j propose 
a hierarchical structured model for action segmentation. 

Many of these works use probabilistic models that max¬ 
imize the negative log-likelihood, such as [H El]- In 


contrast, this paper takes a nonprobabilistic approach, 
wherein an SSVM is optimized via subgradient descent. 
The algorithm is altered to back-propagate SSVM loss er¬ 
rors, based on the ground truth and a loss-augmented pre¬ 
diction into the factors. Moreover, all factors are nonlinear 
functions, allowing the learning of complex patterns that 
originate from interaction features. 

3. Methodology 

In this section, essential SSVM background is in¬ 
troduced, after which integrated inference and back- 
propagation is explained for nonlinear unary factors. Fi¬ 
nally, this notion is generalized into an SSVM model using 
only neural factors which are optimized by an alteration 
of subgradient descent. 

3.1. Background 

Traditional classification models are based on a predic¬ 
tion function / : A —>■ K that outputs a scalar. In contrast, 
structured prediction models define a prediction function 
/ : A —)■ V, whose output can be arbitrarily structured. 
In this paper, this structure is represented by a vector in 
V = >C", with £ c N a set of class labels. Structured 
models employ a compatibility function ^ : A x V —>■ 
parametrized by w S K^. Prediction is done by solving 
the following maximization problem: 

/(x) = argmaxg(a;,y; w). (1) 

yey 

This is called inference, i.e., obtaining the most-likely 
assignment of labels, which is similar to maximum-a- 
posteriori (MAP) inference in probabilistic models. Be¬ 
cause of the combinatorial complexity of the output space 
V, the maximization problem in Eq. § is NP-hard [20] . 
Hence, it is important to impose on g some kind of regular¬ 
ity that can be exploited for inference. This can be done 
by ensuring that g corresponds to a nonprobabilistic factor 
graph, for which efficient inference techniques exist [1] . In 
general, g is linearly parametrized as a product of a weight 
vector w and a joint feature function tp : A x V —> 

Commonly, a decomposes as a sum of unary and inter¬ 
action factor^ in which ip = The func¬ 

tions (pu and Pi are then sums over all individual joint 
input-output features of the nodes 'ipi{y,x) and interac¬ 
tions 'ijjij{y,x) of the corresponding factor graph [T] [T2] . 
For example in the use case of Section nodes are im¬ 
age regions, while interactions are connections between 
regions, each with their own joint feature vector. Data 
samples {x,y) are conform this graphical structure, i.e., x 
is composed of unary features x^ and interaction features 


^Maximizing g corresponds to minimizing the state of a nonproba¬ 
bilistic factor graph, which factorizes into a product of factors. How¬ 
ever, by operating in the log-domain, the state decomposes as a sum 
of factors. 
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. Moreover, the unary and interaction parameters are 
generally concatenated as w = 

In this formulation, the unary features are defined as 

, ( 2 ) 

while the interaction features for 2nd-order (edges) inter¬ 
actions are defined as 

^ijiyi,yj) = i^ij{x)[yi =mAyj = , ( 3 ) 

with ei{x) the unary features corresponding to node i and 
(x) the interaction features corresponding to interaction 
(edge) (i,j). Similarly, higher-order interaction features 
can be incorporated by extending this matrix into higher- 
order combinations of nodes, according to the interactions. 
In the experiments of this paper, unary features are bag- 
of-words features corresponding to each superpixel. Inter¬ 
action features are also bag-of-words, but this time corre¬ 
sponding to all connected superpixels. 

In an SSVM the compatibility function is linearly 
parametrized as g{x,y;w) = {w,ip{x,y)) and optimized 
effectively by minimizing an empirical estimate of the reg¬ 
ularized structured risk 

A ^ 

+ ( 4 ) 

n—1 

with A : 3^ X 3^ —K+ a structured loss function for which 
holds Vy, y' : A(y, y') > 0, A(y, y) = 0, and A(y', y) = 
A(y,y'); R a regularization function; A the inverse of the 
regularization strength; for a set of N training samples 
._7V} C df X 3^ that can be decomposed into 
Vn nodes and En interactions. In this paper, we make use 
of L 2 -regularization, hence R{w) = |||ui|p. Furthermore, 
in line with our image segmentation use case in Section 
the loss function is the class-weighted Hamming distance 
between two label assignments, or 

= '^y{y7)[y'i ^ ( 5 ) 

i=l 

with [•] the Iverson brackets and 14, the number of nodes 
(i.e., inputs to the unary factors, which corresponds to the 
number of nodes in the underlying factor graph) in the 
n-th training sample. Contrary to maximum likelihood 
approaches m [201 nn, the Hamming distance allows us 
to directly maximize performance metrics regarding accu¬ 
racy. By setting ri{yf) = 1 we can focus on node-wise 
accuracy, while setting ? 7 (y") = {J2in{y? — 2/*])”^ allows 
us to focus on class-mean accuracy. 

Due to the piecewise nature of the loss function A, tra¬ 
ditional gradient-based optimization techniques are inef¬ 
fective for solving Eq. However, according to Zhang 
HZ], the equations 

1 A ^ 

L('u;) =-llwf-f — Einax{£(a;",y”; w),0}, with (6) 

n—1 


£(a:”,y";u;) = 

max[A(y”, y) - y(a;”, y”; w) + g{x^, y; w)], (7) 

yey 

define a continuous and convex upper bound for the actual 
structured risk in Eq. @ that can be minimized effectively 
by solving arg min,y,g]g£) L{w) through numerical optimiza¬ 
tion 

3.2. Integrated back-propagation and inference 

Traditional SSVM training methods optimize a joint 
parameter vector of the unary and interaction factors. 
However, they restrict these parameters to linear com¬ 
binations of input features, or allow limited nonlinear¬ 
ity through the addition of kernels. The objective func¬ 
tion in case of arbitrary nonlinear factors is often hard 
to optimize, as many numerical optimization methods re¬ 
quire a convex objective function formulation. Eor exam¬ 
ple, A-slack cutting plane training requires the conversion 
of the max-operation in Eq. Q to a set of N\y\ linear 
constraints for its quadratic programming procedure |29j : 
block-coordinate Erank-Wolfe SSVM optimization m as¬ 
sumes linear input dependencies; the structured percep- 
tron similarly assumes linear parametrization EH; and 
dual coordinate descent focuses on solving the dual of the 
linear L 2 -I 0 SS in SSVMs [32]. 

Subgradient descent minimization, as described in [TJ 
133] j is a flexible tool for optimizing Eq. ^ as it naturally 
allows error back-propagation. This algorithm alternates 
between two steps. Eirst, 

z” = arg max[A(y”, y) -h {w, (p{x’^, y))] (8) 

yey 

is calculated for all N training samples, which is called 
the loss-augmented inference or prediction step, derived 
from Eq. Q. In this paper, general inference for deter¬ 
mining Eq. Q is approximated via the a-expansion [34j 
algorithm, whose effectiveness has been validated through 
extensive experiments [35] . Loss-augmented prediction as 
in Eq. Q is incorporated into this procedure by adding 
the loss term ?7(yf)[yf 4 Vi] to the unary factors. 

Second, these z-values are used to calculate a subgradi- 
en10ofEq. ^ as ^ [w-|-A (y? (a;", z”) — (^(x", y"))] for each 
sample (cc", y”), in order to update w. Traditional SSVMs 
assume that g{x, y; w) = {w, ip{x, y)) in which is a prede¬ 
fined joint input-output feature function. Commonly, this 
joint function is made up of the outputs of a nonlinear 
‘unary’ classifier C : T —>• [0,1]'^^ such that (puix,y) be¬ 
comes ipu{C{x),y) [3S]. This classifier is trained upfront, 
based on the different unary inputs corresponding to each 
node in the underlying factor graph. Due to the linear 
definition of y, the SSVM model is learning linear combi¬ 
nations of these classifier outputs as its unary factors. In 


S M.X’ is a subgradient of / : RX> ^ R in a point po if f{p) — 
f{Po) > ~ Po)- Due to its piecewise continuous nature, Eq. § 

is nondifferentiable in some points, hence we are forced to rely on 
subgradients. 
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Algorithm 1: Integrated SSVM subgradient descent with neural unary and linear interaction factors 
Input: ^ iterations T; learning rate curve ^/(to + t)] inverse regularization strength A; training samples 

Output: optimized parameters 6 G and w G 

1 Initialize w to 0 and 0 according to [55]; the output layer weights are initialized to 0. 

2 for 1 < < < T do 
for 1 < n < iV do 

<— aigmaXy^y[A(y",y) + // loss-augmented prediction in Eq. (j^ 

if p) — g(x", y"; 0, w) + g(x”, z"; 0, w) > 0 then // max-operation in Eq. 

dL 

” {0, w) ^ w + \{tpi (x", t") — y>i{,x'^, y" // standard SSVM subgradient computation m 


10 

11 

12 


dw 


WgL^{0, w)^0 + X{Wsf{x^, t”; 0) - ^ef{x^, 2/”; 0)) 


11 gradient computation as in Eq. pO 


else 


dL 

'VgLn{0,w) G- 0 and -^{0,w) 


end 


dw 


end 


fJ- 


1 




tg t N 


dLr, 


n—1 


dw 


{0Gi 


13 

14 end 


0 G- backprop ( ^ Yln=i ^eLn{0, w) 


// update linear interaction factors 
// update neural unary factors via back-propagation 


general, the interaction factors are not trained through a 
separate classifier, and are thus linear combinations of the 
interaction features directly. 

We propose to replace the pretraining of a nonlinear 
unary classifier, and the transformation of its outputs 
through linear factors, by the direct optimization of non¬ 
linear unary factors. In particular, the unary part of g is 
represented by a sum / of outputs of an adapted neural 
network which models factor values. To achieve this, the 
loss-augmented prediction step defined in Eq. (|^ is altered 
to 


0 " = argmax[A(i/",j/) -b {w,ipiix'^,y)) + fix'^,y;0)], (9) 
yey 

in which ipi represents the joint interaction feature func¬ 
tion as described in Section 3.1 and Eq. ([^. Eq. Q is 
calculated similarly to Eq. ([^ through a-expansion by en¬ 
coding the loss term into the unary factors. 

The compatibility function thus becomes g(x, y\0, w) = 
{w, ipi{x, y)) + f{x, y] 0). The calculation of originally 
defined as the subderivative of the objective function in 
Eq. (§, remains unaltered. However, we can no longer 
assume that ^ conforms to the definition of a subgradient 
due to its nonconvexity. However, we can calculate 


X7eL{0,w) = 

^ + 4 E 2^"; ^)) ’ (10) 

with Af the set of indices corresponding to training samples 
for which i{x^,y'^;w) > 0 in Eq. Q, for a particular loss- 


augmented prediction z”. In case £(a;”,y";ic) = 0, we set 
VgE = 0. This gradient incorporates the loss-augmented 
prediction of Eq. and is back-propagated through the 
underlying network to adjust each element of 0. The al¬ 
tered subgradient descent method is shown in Algorithm[2 
Herein, represents the objective function for the n-th 
training sample, i.e., L„(0,w) = ^ ||w|p-I-A(A(y”, z")- 

g{x^, y"; 0, w) + g{x^, z”; 0, w))]. 

In contrast to gradient descent, subgradient methods 
DEa do not guarantee the lowering of the objective func¬ 
tion value in each step. Therefore, the current best value 
= min{Ll* L{w^*^)} is memorized in each iteration 
t, along with the corresponding parameter values (w*, 0*). 
As such, the objective value L* decreases at each step 
as = min{L(i(;(^)),..., L(w^*^)}. This update rule is 
omitted from Algorithm to improve readability. 

Because the loss terms in Eq. 0 are no longer affine in¬ 
put transformations due to the introduced nonlinearities 
of the neural network, we can no longer assume Eq. (|^ 
to be convex, as is the case for conventional SSVMs. Al¬ 
though theoretical guarantees can be made for the conver¬ 
gence of (sub)gradient methods for convex functions [37], 
and particular classes of nonconvex functions |38j , no such 
guarantees can be made for arbitrary nonconvex functions 
[39] . The problem of optimizing highly nonconvex func¬ 
tions is studied extensively in neural network gradient de¬ 
scent literature. However, it has been demonstrated that 
nonconvex objectives can be minimized effectively due to 
the high dimensionality of the neural network parameter 
space m- Dauphin et al. m show that saddle points are 
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Algorithm 2: Integrated SSVM subgradient descent with both unary and interaction neural factors 
Input: ^ iterations T; learning rate; inverse regularization strength A; training set {(x",?/”)} 
Output: optimized parameters 6 G and 7 € 

1 Initialize 0 and 7 according to [28]; the weights of the output layers are initialized to 0. 

2 for 1 < t < T do 
for I < n < iV do 

z" ^ aigmax^^y[A(y",y) + f{x^,y;0) + h{x^,y,j)] 
if A(y”, y) - y"; 0, 7 ) + y(x”, z”; 6 », 7 ) > 0 then 

VgLnie, 1)^0 + a(v,/(x", z”; 0) - Ve(x", y”; 0)) 

and V^L„( 6 », 7 ) ^ 7 + A(^V-y/i(x", z"; 7 ) - V^/i(x", y”; 7 )^ 


8 

9 

10 

11 


else 


VeL„(0,7) ^ 9 and V^L„( 0 , 7 ) ^ 0 


end 
end 


12 

13 end 


backprop 


J2n=i'^sLn{0,j)^ and 7 e- backprop 


much likelier than local minima in multilayer neural net¬ 
work objective landscapes. In particular, the ratio of sad¬ 
dle points to local minima increases exponentially with the 
parameter dimensionality. Several methods exists to avoid 
these these saddle points, e.g., momentum [42]. Further¬ 
more, Dauphin et al. [41j show, based on random matrix 
theory, that the existing local minima are very close to the 
global minimum of the objective function. This can be un¬ 
derstood intuitively as the probability that all directions 
surrounding a local minimum lead upwards is very small, 
making local minima not an issue in general. The empir¬ 
ical results presented in Section [4^ reinforce this believe 
by demonstrating that the regularized objective function 
can still be minimized effectively, as we achieve accurate 
predictions. 

As described in Algorithm]^ the (sub)gradient is defined 
over whole data samples, which each consist of multiple 
nodes. / thus models the unary part of the compatibility 
function g, which is a sum of the Vn unary factors. There¬ 
fore, the function /(x, y; 0) decomposes as a sum of neural 
unary factors 

V7 

/(x,y;0)=^r(xf;%, (II) 

with x^ the unary features in x. The nonlinear function 
f* : X ^ is a multiclass multilayer neural network 
parametrized by 0 S whose inputs are features cor¬ 
responding to the Vn different nodes. It forms a template 
for the neural unary factors. In this network f*{xY ; 0), the 
softmax-function is removed from the output layer, such 
that it matches the unary factor range The argu¬ 

ment y of the joint feature function is used as an index y^ 
to select a particular output unit. 


3.3. Neural interaction factors 

In this section we extend the notion of nonlinear fac¬ 
tors beyond the integration of the training of a unary 
classifier. We now also replace the linear interaction part 
{w,ipi{x,y)) of the compatibility function g with a func¬ 
tion h{x, y; 7 ) that decomposes as a sum of neural inter¬ 
action factors 

Hx,y,l) = J2h*{xl]j)j^^l^yy ( 12 ) 

with x^ the interaction features in x, Mi{y) the combi¬ 
nation of node labels in the f-th interaction, and En the 
number of interactions in the n-th training sample. The 
function h* : X IRI'^I'^ is parametrized by 7 € IR^, 
and forms a template for the interaction factors. Herein, 
Q depends on the interaction order, e.g., Q = 2 in the 
Section [31 use case as connections between nodes are then 
edges. Interaction factors are generally not trained up¬ 
front. However, neural interaction factors are useful as 
they can extract complexer interaction patterns, and thus 
transcend the limited generalization power of linear com¬ 
binations. In image segmentation for example, interaction 
features consisting of vertical gradients and a 90°-angle 
can indicate that the two connected nodes belong to the 
same class. The loss-augmented inference step in Eq. (|^ 
is now adapted to 

z” = argmax[A(y-, y) + /(x”, y; 0) + /i(x^ y; 7 )], (13) 
yey 

while the compatibility function becomes y(x, y; 0 , 7 ) = 
f{x,y\0) /i(x,y; 7 ). The two distinct models / and h 

are trained in a similar fashion to the method described in 
Algorithm [^ as depicted in Algorithm [^ Notice that this 
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method can easily be adjusted for batch or online learning 
by adapting and moving the weight updates at line [I^ into 
the inner loop. 

Like the unary function f* in Eq. (11), is 

a multiclass multilayer neural network in which the top 
softmax-function is removed, shared among all inter¬ 
action factors. The output layer dimension matches the 
number of interaction label combinations, in the most 
general case. For example in image segmentation, for a 
problem with symmetric edge features, the number of out¬ 
put units in h* is i|£|(|£|-|-l), which all represent different 
states for a particular interaction factor (in this case the 
interactions are undirected edges, thus Afiiy) consists of 
the f-th edge’s incident nodes). 

The resulting structured predictor no longer requires 
two-phase training in which linear interaction factors are 
combined with the upfront training of a unary classifier, 
whose output is transformed linearly into unary factor val¬ 
ues. It makes use of highly nonlinear functions for all 
SSVM factors, by way of multilayer neural networks, us¬ 
ing an integration of loss-augmented inference and back- 
propagation in a subgradient descent framework. This al¬ 
lows the factors to generalize strongly while being able to 
mutually adapt to each other’s parameter updates, leading 
to more accurate predictions. 


4. Experiments 

In this section, our model is analyzed on the task of 
image segmentation. Herein, the goal is to label different 
image regions with a correct class label. This is cast into a 
structured prediction problem by predicting all image re¬ 
gion class labels simultaneously. There is one unary factor 
in underlying SSVM graphical structure for every image 
region, while interactions represent edges between neigh¬ 
boring regions. First, our model is analyzed and its dif¬ 
ferent variants are compared to conventional SSVM train¬ 
ing schemes. Second, the best performing variant is com¬ 
pared with state-of-the-art segmentation approaches. Our 
model is implemented as an extension of PyStruct [iH] . 
using Theano [44] for GPU-accelerated neural factor opti¬ 
mization. 

4.I. Experimental setup 

The model analysis experiments are executed on the 
widely-used MSRC-21 benchmark [3S], which consists of 
276 training, 59 validation, and 256 testing images. This 
benchmark is sufficiently complex with its 21 classes and 
noisy labels, and focuses on object delineation as well as 
irregular background recognition. Furthermore, the exper¬ 
iments are executed on the KITTI benchmark con¬ 
sisting of 100 training and 46 testing images, augmented 
with 49 training images of Kundu et al. |47|. This lat¬ 
ter benchmark consists of 11 classes, but we drop the 3 
least frequently-occurring ones as they are insufficiently 
represented in the dataset. Finally, the same experiment 


is repeated for a larger dataset, namely the SIFT Flow 
benchmark |3H], consisting of 33 classes with 2488 train¬ 
ing and 200 testing images. 

All image pixels are clustered into ±300 regions using 
the SLIC [49] superpixel algorithm. For each region, gra¬ 
dient (DAISY [SD]) and color (in HSV-space) features are 
densely extracted. These features are transformed two 
times into separate bags-of-words via minibatch fc-means 
clustering (once 60 gradient and 30 color words, once 10 
and 5 words). The unary input vectors form (60 ± 30)-D 
concatenations of the first two bags-of-words. The model’s 
connectivity structure links together all neighboring re¬ 
gions via edges. The edge/interaction input vectors are 
based on concatenations of the second set of bags-of-words. 
Both (10±5)-D input vectors of the edge’s incident regions 
are concatenated into a (2 x (10±5))-D vector. Moreover, 
two edge-specific features are added, namely the distance 
and angle between adjacent superpixel centers, leading to 
(2 X (10 ± 5) ± 2)-D interaction feature vectors. 

Factors are trained with (regular) momentum, using a 
learning rate curve with p, and to parameters, and 

t the current training iteration number as used in Algo¬ 
rithms [2 an dH The regularization, learning rate, and mo¬ 
mentum hyperparameter values are tuned using a valida¬ 
tion set by means of a coarse- and fine-grained grid search 
over the parameter spaces, yielding separate settings for 
the unary and pairwise factors. The linear parameters w 
are initialized to 0, while the neural factor parameters 9 
and 7 are initialized according to |28j . except for the top 
layer weights which are set to 0. The class weights piy^) in 
Eq. ([^ are set to correct for class imbalance. The model is 
trained using CPU-parallelized loss-augmented prediction, 
while the neural factors are trained using GPU parallelism. 

The following models are compared: unary-only 
(unary), V-slack cutting plane training (CP) with delayed 
constraint generation, subgradient descent (SGD)[^ inte¬ 
grated training with neural unary and linear interaction 
factors (int±lin), bifurcated training with neural interac¬ 
tion factors (bif±nrl), and integrated training with neural 
unary and neural interaction factors (int±nrl). 

Multiclass logistic regression is used as unary classifier, 
trained with gradient descent by cross-entropy optimiza¬ 
tion. All unary neural factors contain a single hidden layer 
with 256 tanh-units, for direct comparison of integrated 
learning with upfront logistic regression training. The in¬ 
teraction neural factors contain a single hidden layer of 
512 tanh-units to elucidate the benefit of nonlinear fac¬ 
tors, without overly increasing the model’s capacity. The 
experiment is set up to highlight the benefit of integrated 
learning by restricting the unary factors to features insuffi¬ 
ciently discriminative on their own. This deliberately leads 
to noisy unary classification, forcing the model to rely on 
contextual relations for accurate prediction. The interac¬ 
tion factors encode information about their incident region 


®SGD uses bifurcated training with linear interactions, hence it 
could be named bif-|-lin. 
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Figure 1: Illustrative examples of the performance of SGD and int+nrl on several MSRC-21 test images. Integrated 
training with neural factors improves classification accuracy over subgradient descent. The last column presents a case 
in which our model fails to outperform SGD. 



Figure 2: Illustrative examples of the performance of SGD and int+nrl on several KITTI test images. Integrated training 
with neural factors improves classification accuracy over subgradient descent. The last column presents a case in which 
our model fails to outperform SGD. 



Figure 3: Illustrative examples of the performance of SGD and int+nrl on several SIFT Flow test images. Integrated 
training with neural factors improves classification accuracy over subgradient descent. The last column presents a case 
in which our model fails to outperform SGD. 
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Table 1: MSRC-21 class, pixel-wise, and class-mean test accuracy (in %) for different models 
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Table 2: KITTI class, pixel-wise, and class-mean test accuracy (in 
%) for different models 
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Table 3: SIFT Flow pixel-wise and 
class-mean test accuracy (in %) for dif¬ 
ferent models 



pixel 

class 

unary 

44.7 7.5 

CP 

62.5 13.8 

SGD 

65.9 15.3 

int-|-lin 
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68.8 16.1 

int-|-nrl 

71.3 17.0 

inff -l-lin 

70.2 15.6 

3-layer 

71.5 17.2 


feature vectors to allow neural factors to extract mean¬ 
ingful patterns from gradient/color combinations. We de¬ 
liberately encoded less information in the interaction fea¬ 
tures, such that the model cannot solely rely on interaction 
factors for accurate and coherent predictions. 

4-2. Results and discussion 

Accuracy results on the MSRC-21 |3S] test images are 
presented in Table while Figure shows a handful of 
illustrative examples that compare segmentations attained 
by SGD with int-|-nrl. The results of the same experiment 
for the KITTI benchmark [35] , augmented with additional 
training images Kundu et al. ST], are shown in Table ^ 
and Figure Qualitative results on the SIFT Flow [4^ 
dataset are shown in Figure]^ while accuracy results are 
shown in Table |3j 

The results show that unary-only prediction is very inac¬ 
curate (pixel-wise/class-mean accuracy of 36.3/23.1% for 
the MSRC-21 dataset, 53.8/42.8% for the KITTI dataset, 
and 44.7/7.5% for the SIFT Flow dataset). The reason 


for this is that unary features are not sufficiently dis¬ 
tinctive to allow for differentiation between classes due 
to their low dimensionality. Accurate predictions are 
only possible by taking into account contextual output 
relations, demonstrated by the increased accuracy of CP 
(MSRC-21: 59.4/48.5%; KITTI: 61.5/46.7%; SIFT Flow: 
62.5/13.8%) as well as SGD (MSRC-21: 59.2/49.6%; 

KITTI: 65.5/50.6%; SIFT Flow: 65.9/15.3%). These 
structured predictors learn linear relations between image 
regions, which allows them to correct errors originating 
from the underlying unary classifier. However, the unary 
factor’s linear weights w have only limited capability for 
error correction in the opposite direction, due to the fact 
that the SSVM cannot alter the unary classifier parame¬ 
ters post-hoc. 

Using an integrated training approach such as int-flin, 
in which the SSVM is trained end-to-end, improves accu¬ 
racy (MSRC-21: 67.4/58.5%; KITTI: 70.2/57.8%; SIFT 
Flow: 70.2/15.6%) over the bifurcated procedures CP 
and SGD. Although neither the unary or interaction fea¬ 
tures are very distinctive, the integrated procedure up- 
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Figure 4: Visualization of the synergy between unary and interaction factors. In bifurcated training the interactions 
make unary factors redundant as these cannot be adapt to errors made by the interactions. In integrated training, 
combining both factor types leads to a higher accuracy as they can mutually adapt to each other’s weight updates. 

Table 4: State-of-the-art comparison: MSRC-21 per-class, class-mean, and global pixel-wise test accuracy (in %) for 
different models 
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dates parameters in such a way that both factor types 
have a unique discriminative focus. Their synergistic re¬ 
lationship ultimately results in higher accuracy. To bet¬ 
ter compare SGD (which uses 8, 21, and 33 logistic re¬ 
gression outputs as unary input features for the differ¬ 
ent benchmarks) with int-|-lin, we also depict the accu¬ 
racy (MSRC-21: 61.2/51.7%; KITTI: 69.2/53.5%; SIFT 
Flow: 70.2/15.6%) of a model (int^-flin) with only 8, 
21, and 33 unary hidden units for the KITTI, MSRC- 
21, and SIFT Flow dataset, rather than 256 units. The 
2.0/2.1% (MSRC-21), 3.7/2.9% (KITTI), and 4.3/0.3% 
(SIFT Flow) increases in accuracy over SGD further il¬ 
lustrates the benefit of integrated learning and inference 
over conventional bifurcated SSVM training. 

Another insight gained by the results is that accu¬ 
racy increases when replacing linear interaction factors 
of conventional SSVMs with neural factors, i.e., int-|-nrl 
(MSRC-21: 70.1/62.3%; KITTI: 75.6/60.9%; SIFT Flow: 
71.3/17.0%) and bif-knrl (MSRC-21: 62.7/53.7%; KITTI: 
70.0/55.9%; SIFT Flow:68.8/16.1%) outperform int-blin 
(MSRC-21: 67.4/58.5%; KITTI: 70.2/57.8%; SIFT Flow: 
70.3/16.2%) and SGD (MSRC-21: 59.2/49.6%; KITTI: 
65.5/50.6%; SIFT Flow: 65.9/15.3%) respectively. This 
increase can be attributed to the higher number of param¬ 
eters, as well as the added nonlinearities in combination 
with correct regularization. The model has greater gener¬ 
alization power, allowing the factors to extract more com¬ 


plex and meaningful interaction patterns. Neural factors 
offer great flexibility as they can be stacked to arbitrary 
depths. This leads to even higher generalization, as indi¬ 
cated by the increased accuracy (MSRC-21: 71.6/65.1%; 
KITTI: 77.6/63.6%; SIFT Flow: 71.5/17.2%) of the deeper 
3-layer (int-|-nrl) model. Herein both unary and interac¬ 
tion factors are 3-hidden-layer neural networks consisting 
of 256 and 512 units (rectified linear units for MSRC- 
21 and KITTI and tanh units for SIFT Flow) in each 
layer respectively. Our model can thus easily be extended, 
for example by letting neural factors represent the fully- 
connected layer in convolutional neural networks. As such, 
it serves as a foundation for more complex structured mod¬ 
els. 

All methods converge within 600 epochs, with one epoch 
taking approximately 12.62 seconds for the MSRC-21 
dataset, 4.35 seconds for the KITTI dataset, and 197.27 
seconds on the SIFT Flow dataset for the int-|-nrl algo¬ 
rithm. Since the implementation of our algorithm is not 
optimized for speed, these values can be further reduced 
by better exploitation of CPU parallelism. 

Figure illustrates the synergy between unary and in¬ 
teraction factors achieved through both integrated and 
bifurcated training, exercised on the MSRC-21 dataset. 
The bars depict model test accuracy when using only 
unary or pairwise factors, by setting either the pairwise 
or unary factors respectively to a zero factor value, thus 
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{w, Lpi{x,y)) or {w,ipu(x^y)) = 0 Vj/ S 3^. Although the 
unary factors alone perform well in bifurcated training, 
nearly all accuracy can be attributed to the interactions. 
A possible explanation is that both types essentially learn 
the same information. The interactions correct errors of 
the underlying classifier and ultimately make unary fac¬ 
tors redundant. In integrated training, neither the unary 
or interaction factors alone attain a high accuracy, but the 
combination of both does. 

We explain this synergistic relationship with an exam¬ 
ple: Unary factors assign to a region of class A, a second- 
to-highest factor value to class A, a highest value to class 
J3, and a low value to class C. The interactions also as¬ 
sign a second-to-highest value to class A, but a highest 
value to class C, and a low value to class B. Independently 
both factors incorrectly predict the region of class A as 
belonging to class B or class C. However, when combined 
they correctly assign a highest value to class A. In the 
figure, bifurcated training only shows limited signs of fac¬ 
tor synergy, as the optimization procedure is insufficiently 
able to steer unary and pairwise parameters in different di¬ 
rections, which causes them have a similar discriminative 
focus. This observation leads us to believe that integrated 
learning and inference results in higher accuracy by syner¬ 
gistic unary/interaction factor optimization. Both factor 
types are no longer optimized for independent accuracy, 
but mutually adapt to each other’s parameter updates, 
which results in enhanced predictive power. 

In addition to the previous experiments, the viability 
of our neural factor model is shown through comparison 
with the closely related work of Liu et al. m on the 
MSRC-21 dataset. Liu et al. make use of features ex¬ 
tracted from square regions of varying size around each 
superpixel, through means of a pretrained convolutional 
neural network. We compare our model with theirs by 
using overfeat features |58j in a similar fashion, trained 
on individual regions. Furthermore, the model settings 
have been altered with respect to the previous experi¬ 
ments. More specifically, 1,000 SLIC superpixels are uti¬ 
lized for the over-segmentation preprocessing step, enforc¬ 
ing superpixel connectivity and merging any superpixel 
with a surface area below a particular threshold. DAISY 
gradient and HSV color features are extracted according 
to a regular lattice, and clustered via minibatch fc-means 
clustering. Next, the same type of features are extracted 
for each individual pixel, leading to unary and pairwise 
factor feature vectors. Moreover, the {x, 2 /)-position of the 
superpixel (median-based) center is included in the unary 
feature vectors, while the distance and angle between the 
two superpixel centers is encoded into the interaction fea¬ 
ture vectors. The neural factors are represented by multi¬ 
layer neural networks using tanh-units, trained according 
to our Algorithm using conventional momentum and 
single image-sized batches per gradient update. Classes 
are balanced by weighing them with the inverse of the class 
frequency. The results are presented in Table which 
indicate that our model is capable of performing on par 


with the current state-of-practice, when used in conjunc¬ 
tion with more advanced methods, e.g., overfeat features. 
Moreover, similar to Liu et al. we have compared our 
model with other less closely related methods for complete¬ 
ness, for which the results are shown below the horizontal 
line in Tabled 

5. Conclusion 

A structured prediction model that integrates back- 
propagation and loss-augmented inference into subgradi¬ 
ent descent training of structural support vector machines 
(SSVMs) is proposed. This model departs from the tra¬ 
ditional bifurcated approach in which a unary classifier is 
trained independently from the structured predictor. Fur¬ 
thermore, the SSVM factors are extended to neural fac¬ 
tors, which allows both unary and interaction factors to 
be highly nonlinear functions of input features. Results 
on a complex image segmentation task show that end-to- 
end SSVM training, and/or using neural factors, leads to 
more accurate predictions than conventional subgradient 
descent and V-slack cutting plane training. Results show 
that our model serves as a foundation for more advanced 
structured models, e.g., by using latent variables, learned 
feature representations, or complexer connectivity struc¬ 
tures. 
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